BaconDecomposition R parity goldens#457
Conversation
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R1 verdict was Looks good with 2 P3 informational items. Both addressed: 1. P3 (Documentation/Tests): `bacon_decompose()` docstring example said "matches R bacondecomp::bacon() at atol=1e-6" without mentioning the documented always-treated convention exception. Qualified the example to spell out the aggregate-vs-per-component split: aggregate parity holds for all panels at atol=1e-6, per-component parity holds when first_treat is bounded below by min(time) (no always-treated), and the divergence on always-treated panels is by convention (Python remap-to-U vs R's `Later vs Always Treated`). Cross-references the REGISTRY note for the full contract. 2. P3 (Documentation/Tests): `TestBaconParityR`'s skip message still said the goldens were "deferred until R is provisioned (see TODO.md)" but the TODO row was removed in this PR. Updated to describe the intended skip case (partial-checkout / packaging scenarios where the committed JSON is unavailable) and dropped the TODO reference. Tests unchanged: 33/33 pass in test_methodology_bacon.py.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R2 verdict was Looks good with 1 P3 informational item. METHODOLOGY_REVIEW.md Test Coverage line read "all active; R parity activates once goldens are committed" - stale after this PR commits the goldens and activates the 3 R-parity tests. Updated to reflect the post-PR state: all 33 tests active including R-parity (with pointer to the committed JSON).
Closes the PR #454 deferred R parity follow-up (TODO.md row removed). Generated `benchmarks/data/r_bacondecomp_golden.json` from the committed `benchmarks/R/generate_bacon_golden.R` script against `bacondecomp 0.1.1` on R 4.5.2. Three DGP fixtures: `uniform_3groups_with_never_treated`, `two_groups_no_never_treated`, `always_treated_remapped`. Parity results at atol=1e-6 via `tests/test_methodology_bacon.py::TestBaconParityR`: - TWFE coefficient: ✅ matches across all 3 fixtures - Weights-sum: ✅ matches across all 3 fixtures - Per-component: ✅ on the 2 non-remap fixtures; **structural convention divergence** on `always_treated_remapped` (skipped per-component, kept aggregate). R keeps `first_treat=1` as a distinct timing cohort and emits `Later vs Always Treated` comparisons; Python's paper-footnote-11 convention remaps those units to `U` and folds them into a single `treated_vs_never` cell per treated cohort. The aggregate is invariant per Theorem 1 — the U bucket's weight is re-allocated across nested 2x2 cells but the total weight on {cohort_k vs U} is identical. Only the per-component breakdown differs structurally between conventions. Tracker promotions: - METHODOLOGY_REVIEW.md: BaconDecomposition status row → **Complete** (was `**Complete** (R parity pending)`); removed from In Progress prose mention; removed from Priority Order substantive-review list; Test Coverage count refreshed (24 → 33); R Comparison Results block rewritten as **Validated**. - docs/methodology/REGISTRY.md: Reference Implementations bullet + Verified Components checklist + Note (weight modes) updated; new Note (R parity convention divergence on always-treated) documents the convention. - TODO.md: BaconDecomposition R parity goldens row removed. - CHANGELOG.md: new `[Unreleased]` Added bullet for the close-out; PR-B Changed entry tightened ("intended to match" → "matching ... at atol=1e-6"). - diff_diff/bacon.py: `bacon_decompose` docstring example wording tightened from "intended to match" to "matches" with TestBaconParityR pointer. Tests: 33/33 pass in test_methodology_bacon.py (no skips; was 30+3 skipped); 32 pass in test_bacon.py; 101 pass across the broader bacon/decompose surface (was 98+3 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R1 verdict was Looks good with 2 P3 informational items. Both addressed: 1. P3 (Documentation/Tests): `bacon_decompose()` docstring example said "matches R bacondecomp::bacon() at atol=1e-6" without mentioning the documented always-treated convention exception. Qualified the example to spell out the aggregate-vs-per-component split: aggregate parity holds for all panels at atol=1e-6, per-component parity holds when first_treat is bounded below by min(time) (no always-treated), and the divergence on always-treated panels is by convention (Python remap-to-U vs R's `Later vs Always Treated`). Cross-references the REGISTRY note for the full contract. 2. P3 (Documentation/Tests): `TestBaconParityR`'s skip message still said the goldens were "deferred until R is provisioned (see TODO.md)" but the TODO row was removed in this PR. Updated to describe the intended skip case (partial-checkout / packaging scenarios where the committed JSON is unavailable) and dropped the TODO reference. Tests unchanged: 33/33 pass in test_methodology_bacon.py.
R2 verdict was Looks good with 1 P3 informational item. METHODOLOGY_REVIEW.md Test Coverage line read "all active; R parity activates once goldens are committed" - stale after this PR commits the goldens and activates the 3 R-parity tests. Updated to reflect the post-PR state: all 33 tests active including R-parity (with pointer to the committed JSON).
a9e3c64 to
86facdd
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R3 verdict was Looks good with 1 P3 informational item. The per-component parity test skips the `always_treated_remapped` fixture (R/Python decompose the U bucket differently by convention), and the REGISTRY note documents that aggregating R's `Later vs Always Treated` + `Treated vs Untreated` rows by treated cohort should match Python's single `treated_vs_never` component for that cohort. The reviewer flagged that the documented structural claim was not directly asserted in tests — a cohort-level regression in the fold-back could slip through under overall TWFE parity. Per memory `feedback_test_coverage_gap_treat_as_actionable`, the "test exists but doesn't directly exercise the documented surface" P3 is actionable. Added `test_always_treated_remapped_fold_back_matches_r` to `TestBaconParityR`: for each treated cohort in the remap fixture, aggregate R's `Later vs Always Treated` + `Treated vs Untreated` rows by combined weight and weight-averaged estimate, then assert both match Python's `treated_vs_never` component for that cohort at atol=1e-6. Currently passes — confirms the documented structural fold-back is exact at numerical precision. Tests: 34/34 pass in test_methodology_bacon.py (was 33; +1 new regression).
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R4 verdict was Looks good with 1 P3 informational item: the per-component parity test skipped the ENTIRE always_treated_remapped fixture, leaving the 6 timing-vs-timing rows (Earlier/Later vs Earlier/Later Treated between cohorts 3/4/5) without direct per-component parity assertions. Per memory feedback_test_coverage_gap_treat_as_actionable, this is the "test exists but doesn't directly exercise the surface" pattern and should be actionable. Narrowed the carve-out: instead of skipping the whole fixture, drop only the treated_vs_never keys from both Python and R sides (the actual U-bucket convention divergence), and keep direct atol=1e-6 parity assertions on the 6 timing-vs-timing keys. Also refined _classify_r_type to canonicalize R's "Later vs Always Treated" type string to treated_vs_never (Python folds those rows into the U bucket per paper footnote 11, so they belong to the U comparison set semantically even though R numbers them by the always-treated cohort), keeping the narrow carve-out simple. Tests: 34/34 pass in test_methodology_bacon.py (+6 directly asserted timing-vs-timing comparisons in the remap fixture vs prior coverage).
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology No findings. The only methodology deviation in scope is the always-treated R/Python convention split, and it is documented in docs/methodology/REGISTRY.md and docs/methodology/REGISTRY.md. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests
|
R5 verdict was Looks good with 1 P3 informational item: docs prose out of sync with the actual parity harness after R4's carve-out narrowing. CHANGELOG, REGISTRY, METHODOLOGY_REVIEW, and the fold-back test's own docstring still said TestBaconParityR has 3 tests and that always_treated_remapped is skipped for per-component parity, while the code now adds a fourth fold-back test and only carves out treated_vs_never rows while keeping direct parity on the 6 timing- vs-timing rows of that fixture. Refreshed 6 surfaces: - METHODOLOGY_REVIEW.md Verified Components checklist + Test Coverage count (33 -> 34) + R Comparison Results subsection for the remap fixture. - docs/methodology/REGISTRY.md Reference Implementations bullet, Requirements checklist, and Note (R parity convention divergence) text to reflect the narrowed carve-out and the fold-back test. - CHANGELOG.md PR-457 Added entry (4 tests, narrowed carve-out description, fold-back test mention). - tests/test_methodology_bacon.py::test_always_treated_remapped_fold_back_matches_r docstring (no longer says the per-component test "skips" the fixture; says it carves out only the U-bucket rows). Tests: 34/34 pass.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology No findings. The only deviation in scope is the always-treated R/Python decomposition convention, and it is explicitly documented in the registry note, so under the stated review policy it is informational rather than a defect at docs/methodology/REGISTRY.md:2681 and docs/methodology/REGISTRY.md:2686. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Removing the Bacon parity deferral from Security No findings. Documentation/Tests
|
…e-out R6 verdict was Looks good with 1 P3 informational item: the older PR-B audit bullet at CHANGELOG.md:13 (added in PR #454) still described the pre-goldens deferral state ("JSON goldens deferred", "TestBaconParityR skips with a pointer when goldens missing", "status flipped to **Complete (R parity goldens pending)**"). That contradicts the new PR-457 bullet at CHANGELOG.md:11 (committed goldens + 4 active parity tests) within the same [Unreleased] section, so the release notes read as internally inconsistent. Updated 3 strings in the PR-B bullet to reflect the within-release close-out: - Status flip wording: now says the (R parity pending) caveat was closed by the parity-goldens bullet above in this same release. - TestBaconParityR description: 4 tests, all active post-release; skips only in partial-checkout scenarios. - (4) outcome: parity goldens deferral was closed within this release.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Removing the Bacon R-parity TODO row is consistent with the committed golden file and the new parity-test surface. Security No findings. Documentation/Tests
Execution note: I was unable to run the Bacon methodology suite in this environment because Path to Approval
|
R7 surfaced a real P1: the REGISTRY presented the always-treated remap condition `first_treat <= min(time)` as "per paper footnote 11", but the paper's strict rule is `t_i < 1` (units treated *before* the first observable period). The inclusive `<= min(time)` rule additionally folds `first_treat == min(time)` cohorts into U — that's a library boundary convention, not a paper-faithful rule. The test class docstring already called this out, but the authoritative REGISTRY contract did not, which read as an undocumented methodology deviation on PRs that promote BaconDecomposition to **Complete**. Resolution: added a new explicit `**Deviation (first-period boundary extension on always-treated remap)**` block to REGISTRY's Bacon section that: - Names the paper's strict `t_i < 1` rule - States the library's inclusive `<= min(time)` rule - Explains the rationale (`first_treat == min(time)` cohorts have no untreated cell in-panel) - Notes R does NOT apply this fold (it keeps such cohorts in their own bucket and emits `Later vs Always Treated`) - Notes the rules coincide when `min(time) > 1` Mirrored in: - REGISTRY Assumption checks bullet (line 2619): now points at the new Deviation block - REGISTRY `**Note (always-treated remap)**` (line 2684): qualifies the "per paper footnote 11" claim - METHODOLOGY_REVIEW.md Deviations block: re-titled to include paper deviations, added the boundary entry as item 1 - `bacon_decompose()` docstring (`bacon.py:467-487`): explicit boundary-extension paragraph with REGISTRY pointer - CHANGELOG PR-457 Added entry: explicit boundary-deviation callout Also fixes R7 P3: CHANGELOG PR-B test count "~24 tests" updated to acknowledge the post-release 34-test count after R-parity-goldens expansion. Tests: 34/34 pass.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0 or P1 findings. Executive Summary
Methodology No findings. The prior undocumented-deviation issue is resolved by the new explicit deviation language in Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Removing the Bacon R-parity deferral from Security No findings. I did not see any secrets or sensitive data concerns in the added golden file. Documentation/Tests
|
R8 verdict was Looks good with 1 P3 informational item: the METHODOLOGY_REVIEW.md Priority Order list jumped from item 12 to item 14 after PR #457 removed Bacon (the prior item 1). Renumbered Survey Data Support from 14 to 13 to close the gap.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0 or P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R9 verdict was Looks good with 1 P3 informational item: the bacon_decompose docstring example said per-component parity holds "when first_treat is bounded below by min(time)" — that's the inverse of the correct condition. Under the library's `<= min(time)` remap rule, always-treated panels are exactly the ones WITH at least one non-sentinel `first_treat <= min(time)`, so per-component parity holds when all non-sentinel cohorts have `first_treat > min(time)` (i.e. bounded *above* by min(time), or equivalently "no first- period-treated cohorts"). Rephrased. Tests: 34/34 pass; no behavior change (docstring-only fix).
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…meta
Fresh R10 verdict was Looks good with 2 P3 informational items:
1. P3 (Maintainability): the always-treated fold-back test selected
R rows via case-sensitive literal substrings ("Untreated",
"Always Treated", "Later"), while the neighboring _classify_r_type
classifier uses case-insensitive semantic matching. Made the
selector consistent — case-insensitive matching on "untreated" /
"never" / "always" tokens, so the fold-back survives bacondecomp
label variation across versions.
2. P3 (Documentation/Tests): committed golden JSON's meta.description
still advertised full per-component (treated, control, type) tuple
parity as the contract, but PR #457 intentionally replaces that for
the always_treated_remapped U-bucket rows with aggregate +
fold-back parity. Updated meta.description to describe the actual
three-tier contract (aggregate / direct per-component on
non-remap + 6 timing-vs-timing rows / cohort fold-back for U
bucket) with a pointer to the REGISTRY Notes that document the
convention divergence.
Tests: 34/34 still pass.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology No findings. The changed registry/review text is consistent with the paper review’s footnote-11 distinction ( Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests
|
…late R11 verdict was Looks good with 1 P3 informational item: I had updated the committed JSON's meta.description in R10 to describe the narrowed contract, but the R generator script at benchmarks/R/generate_bacon_golden.R still had the old "atol=1e-6 on per-component (treated, control, type) tuples plus TWFE coefficient" description in BOTH (a) its header docstring (lines 8-22) AND (b) its meta.description value template (lines 218-225). Re-running the script would have overwritten my committed JSON polish with the old contradictory description. Updated both surfaces to the three-tier contract: (1) aggregate TWFE + weights-sum on all 3 fixtures; (2) direct per-component parity on the 2 non-remap fixtures + 6 timing-vs-timing rows of always_treated_remapped; (3) cohort fold-back parity for the U bucket on always_treated_remapped. Pointers to REGISTRY Note (R parity convention divergence on always-treated) + Deviation (first- period boundary extension). Re-ran the R script; JSON written matches the committed text and tests remain green (4/4 in TestBaconParityR, 34/34 across the file). Script is now idempotent on its own committed output.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R12 verdict was Looks good with 1 P3 informational item: the fixture-3 inline comment in benchmarks/R/generate_bacon_golden.R still described the old contract — said R "natively groups first_treat=1 with U" (wrong; R keeps them as a distinct cohort and emits `Later vs Always Treated`) and said "30 never-treated" (wrong; the script builds 25 never-treated). The header docstring + meta.description template were updated in R11, but this inline block-comment slipped. Rewrote the inline comment to match: (a) the actual fixture construction (5 always-treated, 25 never-treated, 3 timing cohorts at times 3/4/5); (b) the correct R behavior (separate cohort, separate `Later vs Always Treated` rows); (c) pointers to REGISTRY note + deviation block; (d) what the parity tests carve out vs fold-back.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Residual risk: source inspection only; I could not execute |
…eased] CHANGELOG conflict (PR #457 BaconDecomposition R parity goldens) # Conflicts: # CHANGELOG.md
Summary
benchmarks/data/r_bacondecomp_golden.jsonfrom the committedbenchmarks/R/generate_bacon_golden.Rscript againstbacondecomp 0.1.1on R 4.5.2 (3 DGP fixtures).tests/test_methodology_bacon.py::TestBaconParityRnow active (3/3 pass, no skips): TWFE coefficient parity + weights-sum parity atatol=1e-6across all 3 fixtures; per-component estimate + weight parity atatol=1e-6on the 2 non-remap fixtures.always_treated_remapped: R keepsfirst_treat=1as a distinct timing cohort (Later vs Always Treatedrows); Python's paper-footnote-11 convention remaps those units toUand folds them into a singletreated_vs_nevercell per treated cohort. Aggregate is invariant per Theorem 1; per-component breakdown differs. Per-component test skipped on this fixture with explicit documentation; aggregate parity locked.METHODOLOGY_REVIEW.mdstatus row →**Complete**(was**Complete** (R parity pending)). Removed from In Progress prose mention + Priority Order list.Methodology references (required if estimator / math changes)
bacondecomp::bacon()(CRAN).**Note (R parity convention divergence on always-treated)**. The aggregate TWFE coefficient + weights-sum match R atatol=1e-6; only the per-component U-bucket decomposition differs (R splits always-treated as separate type; Python remaps to U per paper footnote 11). Theorem 1's identity is invariant to this re-bucketing.Validation
tests/test_methodology_bacon.py(per-component test now skips thealways_treated_remappedfixture with explicit reason; aggregate tests unchanged). 33/33 in methodology bacon file (was 30+3 skipped); 32 intest_bacon.py; 101 across broader bacon/decompose surface (was 98+3 skipped).Security / privacy
Generated with Claude Code